You are viewing the RapidMiner Studio documentation for version 10.2 - Check here for latest version
Process Documents from Web (Web Mining)
Synopsis
This operator allows to crawl the web and preprocess the single pages before storing them with additional information in an ExampleSet.Description
This operator is quite similar to the Crawl Web operator, but additionally allows to extract information from web pages without any need to save the complete page at first. This behavior might be more appropriate if you are going to crawl a huge number of pages but discard most of the content.
An advanced settings where to use this operator is when you crawl pages that consist of sub parts that are interesting for you. You might cut the document of the web page inside this operator using a Cut operator and deliver the whole collection of documents to the inner sink of this operator. Each document of the collection will become one example. If you have attached additional meta information by using e.g. the Extract Information operator, this will be stored as additional attribute.
The internal crawler will start on the specified starting URL to load pages and follow all links as commanded by the rules. There are different types of rules, each one applied in different situations:
- store_with_matching_url: If the regular expression matches the URL, this page will be stored in the resulting ExampleSet.
- store_with_matching_content: If the regular expression matches the page content, this page will be stored in the resulting ExampleSet.
- follow_link_with_matching_url: If the regular expression matches the URL, the crawler will follow the link and load the URL.
To avoid crawling a potentially unlimited number of pages, the maximal number of pages and depth the crawler will retrieve can be specified with the parameters max pages and max depth. To speed up loading, the delay can be lowered. But please be friendly to the web site owners and avoid causing high traffic on their sites. Otherwise you may get blacklisted. Note that while the crawling makes use of your available CPU cores (license limits apply), usually crawling speed is limited by your bandwidth, the crawling delay and the fact that this crawler is benign and queries the robots.txt for each page it visits.
Please let the ignore robot exclusion parameter be unchecked unless you are going to crawl your own sites. Some site owners might forbid crawling of their content and for legal reasons you may be bound to their will.
Output
- example set (Data Table)
The example set port which returns the crawling results.
Parameters
- urlThe root page from which the crawler will start. Range:
- crawling_rulesSpecifies a set of rules that determine which links to follow and which pages to process. Range:
- retrieve_as_htmlIf selected, the actual HTML is returned instead of the textual representation. Range:
- enable_basic_authIf selected, all requests will send basic auth information in their header. Use only when crawling HTTPS pages! Range:
- usernameUsername for basic authentication. Range:
- passwordPassword for basic authentication. Range:
- add_content_as_attributeSpecifies, whether the pages' content should be added as a text attribute. Range:
- max_crawl_depthSpecifies the maximal depth of the crawling process. A depth of 1 means 'only crawl direct links on the initial page'. Range:
- max_pagesThe maximal number of pages to store. Range:
- max_page_sizeSpecifies the maximum page size (in KB): pages larger than this limit are not downloaded. Range:
- delaySpecifies the delay when visiting a page in milliseconds. Range:
- max_concurrent_connectionsMaximum amount of HTTP connections used at the same time. Range:
- max_connections_per_hostMaximum amount of simultaneous HTTP connections used to connect to a single host. Increasing this parameter can put heavy load on a host so please be careful! Range:
- user_agentThe identity the crawler uses while accessing a server. Range:
- ignore_robot_exclusionSpecifies whether the crawler should ignore the robot exclusion rules set by the crawled page. Enable only for your own sites, otherwise you may end up violating laws! Range: